ANIMAL-SPOT enables animal-independent signal detection and classification using deep learning

Bioacoustic research spans a wide range of biological questions and applications, relying on identification of target species or smaller acoustic units, such as distinct call types. However, manually identifying the signal of interest is time-intensive, error-prone, and becomes unfeasible with large data volumes. Therefore, machine-driven algorithms are increasingly applied to various bioacoustic signal identification challenges. Nevertheless, biologists still have major difficulties trying to transfer existing animal- and/or scenario-related machine learning approaches to their specific animal datasets and scientific questions. This study presents an animal-independent, open-source deep learning framework, along with a detailed user guide. Three signal identification tasks, commonly encountered in bioacoustics research, were investigated: (1) target signal vs. background noise detection, (2) species classification, and (3) call type categorization. ANIMAL-SPOT successfully segmented human-annotated target signals in data volumes representing 10 distinct animal species and 1 additional genus, resulting in a mean test accuracy of 97.9%, together with an average area under the ROC curve (AUC) of 95.9%, when predicting on unseen recordings. Moreover, an average segmentation accuracy and F1-score of 95.4% was achieved on the publicly available BirdVox-Full-Night data corpus. In addition, multi-class species and call type classification resulted in 96.6% and 92.7% accuracy on unseen test data, as well as 95.2% and 88.4% regarding previous animal-specific machine-based detection excerpts. Furthermore, an Unweighted Average Recall (UAR) of 89.3% outperformed the multi-species classification baseline system of the ComParE 2021 Primate Sub-Challenge. Besides animal independence, ANIMAL-SPOT does not rely on expert knowledge or special computing resources, thereby making deep-learning-based bioacoustic signal identification accessible to a broad audience.


Scientific Reports
| (2022) 12:21966 | https://doi.org/10.1038/s41598-022-26429-y www.nature.com/scientificreports/ tions that often resemble the target signal. Consequently, the following classification procedures are conceivable: (1) binary target/noise detection-isolating environmental noise from the taxonomic-dependent animal signals according to the above mentioned data scenarios, or (2) multi-class species/call type recognition-classifying between multiple target species or call types, combined with the illustrated potential data situations. To ensure a robust, animal-independent identification procedure, a binary target/noise detection at the desired taxonomic level (e.g., animal genus) has to be conducted first, to remove noise and other irrelevant animal vocalizations in advance. Taxonomic depth leads to an increasing spectral closeness between signals being represented in the noise class, which results in less distinctive network features separating both classes. Depending on the initial model performance for the chosen taxonomic rank, target/noise data distribution might be restructured with respect to a higher, more generic taxonomic level (e.g. genus to order-level). Based on the respective detection result, subsequent multi-class classification can be conducted with respect to more specific taxonomic ranks, such as animal species (e.g., Blue-winged vs. Golden-winged warbler) or different call types (e.g., monk parakeet alarm, contact, and other calls), ending up in a multi-stage classification procedure. ANIMAL-SPOT is also capable of performing recognition with respect to different species-specific regional differences (dialects) as well as individual identification. Sufficient representative data for dialects of interest or individual-specific vocalizations, in the same way as for the other multi-class classification problems described here, is the only precluding factor. Filtering away noise and other animal vocalizations via the two-step approach enables focus on analysis of regional differences (dialects) and acoustic identification of individuals. Instructions on model configuration and necessary data structure will be further detailed in the user guide 33 with examples.
Animal species and recording setup. In order to show and prove animal independence, 10 different species and 1 additional genus within the chordate phylum were chosen. The overall goal was to test model robustness for as many different habitat types (urban parks, marine reserve, arctic landice), frequency ranges (30 Hz for Atlantic cod to 100,000 Hz for Pygmy pipistrelle), vocalization durations (echolocation sweeps in ms to long roars of multiple seconds), signal-to-noise ratios (urban parks versus noise isolated laboratory), noise characteristics (underwater noise, human narrations, other species), as well as recording setups (passive acoustic monitoring-e.g., Harbour seals-versus focal follows-e.g., Blue-/Golden-winged warblers). A detailed summary regarding all animal-specific recording and data collection setups, utilized within this study, is given in Supplementary Table S1.
Bioacoustic data material. No animals were approached for this study specifically but rather all data used were collected by distinct research teams under their own ethics guidelines. In case of binary detection, each species/genus target and noise was manually annotated. The target class contained only vocalizations produced by the target species/genus (see Table 1). In cases where further sub-classification was envisioned (species level www.nature.com/scientificreports/ for the two warblers- Table 2, call type level regarding monk parakeets- Table 3), these were labeled as well, but all assigned to the target class. The noise class included all other sound segments, such as environmental/ background noise, human narrations, and other animal sounds. While both the number of annotated segments and the class distribution differed for each species, the ratio between vocalization and noise ranges from ≈20% up to ≈57% for all listed data archives. To perform embedded noise augmentation, additional noise segments were provided for some of the species (see Table 1). ANIMAL-SPOT was trained and evaluated in three different experiments: (1) detection between target and noise to separate noise from valuable animal signals, and (2) multi-class species classification, and (3) multi-class call type identification. Besides the annotated detection data   Table 1, three additional unseen recordings were provided for the 10 different species and 1 extra genus, with low, medium, and high appearance of target vocalizations. These were additionally used to validate model performance. In order to prove detection accuracy even further, an additional publicly-available dataset was utilized-the BirdVox-Full-Night data archive-presented by Lostanlen et al. 36 for the evaluation of approaches designed to detect avian flight calls (see Table 1, last row). The original dataset consists of 9.8 h of audio, recorded by six sensors placed in the area around Ithaca, New York, which were manually annotated resulting in 35,402 500 ms-long flight calls of nocturnally migrating birds of about 25 species of passerines. To balance the dataset an equal number of 500-ms-long noise samples were added to the dataset, resulting in 70,804 files (see Table 1, last row). Regarding the BirdVox-Full-Night archive, there were no additional unseen recordings, compared to the remaining data repositories listed in Table 1. Multi-class species classification was conducted between Blue-winged and Golden-winged warbler (see Table 2). In addition the Computational Paralinguistics Challenge Primate (ComParE-PRS) 37,38 dataset was used to distinguish between four different primate species (see Table 4). The dataset includes over 10,000 annotated vocalizations from Chimpanzees (Pan troglodytes), Mandrills (Mandrillus sphinx), Red-capped mangabeys (Cercocebus torquatus), and a mixed group of Guenons (Cercopithecus spp.). Additionally, exactly the same number of noise samples as vocalizations were extracted to make up the noise class 38 . Multi-class call type classification was computed for three main call type classes of monk parakeet vocalizations including alarm, contact, and other calls, listed in Table 3. Across all multi-class scenarios, the existing target class repertoire was extended by an additional noise category to simulate real-world scenarios, as well as cover and handle potential false alarms caused within the first detection stage (see Supplementary Figs. S5 and S6).
Deep learning concepts and network architectures. Convolutional Neural Networks (CNNs) were utilized in order to identify animal vocalizations of interest. A CNN is an end-to-end deep learning architecture based on the principles of pattern recognition including a feature learning and classification component being able to efficiently process the complexity of 2-dimensional input data (e.g., images, spectrograms) 34,40,41 . Convolutional layers are responsible for feature learning, while the classification part is done by the fully connected layers 40 . Convolutional layers embed and represent the following important concepts 34,40 : (1) local receptive fields, (2) shared weights, and (3) subsampling (pooling). Due to the fact that convolutional and pooling layers only compute linear operations, CNNs integrate activation layers (e.g., Rectified Linear Unit 34,42 ) as well as normalization layers (e.g., batch normalization 34,43 ) to handle the non-linearity within the data and to ensure a more stabilized and regularized training procedure 34 . Several repetitive sequentially ordered sequences of convolutional, pooling, normalization, and activation layers lead to extracted and learned features which are used as input for the fully-connected layer projecting the features on the respective output classes 34 . The core concept of the presented deep learning framework is based on a so-called Residual Network (ResNet) 35 . A ResNet is a network architecture, which is built up from different concatenated residual layers 35 . A residual layer is constructed from an arrangement of building blocks which in turn consist of weight (e.g., convolutional, fully-connected), normalization (e.g. batch-norm 43 ), and activation layers (e.g., ReLU 42 ), as well as residual-/skip-connections. Due to this residual-/skip-connection technique it is possible to learn a residual mapping ANIMAL-SPOT. The deep learning framework consists of a ResNet18-based CNN, derived from ORCA-SPOT 34 , our previous killer whale deep detection model, which has been adapted and extended to handle all kinds of vocalizing animals. The initial max-pooling layer within the traditional ResNet18 architecture has been removed to avoid losing too much resolution at the early stage of the training process 34 . Depending on the size of the temporal domain T of the input spectrogram, defined by the chosen training sequence length and corresponding FFT-settings, a 512-large global-averaged pooled feature vector, derived from the 512 × F × T feature maps of the last residual layer (see Fig. 2), is generated and mapped to a subsequent fully-connected layer 34 . In www.nature.com/scientificreports/ order to solve the final n-class classification problem the 512 hidden units of the fully-connected layer are processed onto an output layer consisting of n output nodes depending on the classification task (e.g., two classes for target/noise detection, or multiple classes for species/call type classification). ANIMAL-SPOT is capable of handling any number of output classes, and consequently dealing with multi-class classification scenarios as well. Moreover, ANIMAL-SPOT integrates a refactored version of the entire data parsing and pre-processing pipeline of ORCA-SPOT 34 , next to additional normalization techniques, in order to handle and fulfill all needed prerequisites required for dealing with various animal data sources (see Fig. 1 and Tables 1, 2, 3, 4), in combination with varying classification scenarios. Although advances in neural network structures have been made in recent years, the focus of ANIMAL-SPOT is not a specific type of architecture (e.g., ResNet18, etc.). Instead, the aim of ANIMAL-SPOT is to provide an open-source, animal-independent, and expandable machine learning framework, together with a robust and efficient data preprocessing pipeline. We add support by profound user guidelines to address the broadest possible audience. In addition, the capacity of the model used here is comparatively small, which facilitate researchers who may not have the opportunity to access powerful hardware, to train and evaluate their own animal-and task-specific models. Within the ANIMAL-SPOT framework it is possible to integrate any kind of architectural model designs, allowing the deployment of other novel and userpreferred deep neural network concepts.
Data preprocessing. Independent of the classification scenario all species and their corresponding data repositories (see Tables 1, 2, and 3), followed the same generic data preprocessing pipeline. The core functions are applicable for all animals, however each species requires an animal-specific parameter set (see Supplementary Table S2) in order to guarantee valid data preparation and representation of the corresponding signal characteristics (e.g., typical vocalization duration, frequency range, sampling rate, Fast-Fourier-Transform (FFT) parameters, etc.). The entire preprocessing pipeline of ANIMAL-SPOT consists of the following steps: (1) conversion to mono and re-sampling, (2) Short-Time-Fourier-Transform (STFT) to convert the time signal into a F × T-large power-spectrogram using an animal-characteristic FFT window-length and step-size, where F characterizes the frequency domain and T describes the time domain, (3) integrated and concurrent signal augmentation with respect to the previous derived F × T-large power-spectrogram applying uniformly distributed random scalings including intensity, pitch, and time augmentation within given intervals (see Supplementary  Table S2), where the default interval might slightly vary from species to species, (4) linear frequency compression (nearest neighbor, 256 frequency bins) representing a frequency range between f min and f max , while ignoring other frequency regions, chosen according to the typical spectral vocalization areas of the corresponding animals, resulting in a 256 × T compressed power spectrogram, (5) noise augmentation by adding a pitch-/ time-augmented and frequency-compressed noise spectrogram utilizing a uniformly distributed randomly chosen signal-to-noise ratio (SNR), (6) power-spectrogram conversion to decibel (dB) scale, (7) 0/1-min/max-or 0/1-dB-normalization, either using the spectral minimum and maximum, or applying a minimum and reference decibel level, dependent on the respective target species, to normalize the spectral envelope, and (8) random sub-sampling or zero-padding of the spectrogram according to the chosen sequence length, leading to a final 256 × T augmented and normalized spectral clip being used as network input. Figure 3 visualizes example network input spectrograms for each species, preprocessed according to the illustrated pipeline.
Network training and evaluation. Due to ANIMAL-SPOT's ResNet18-based feature extraction and compression path (see Fig. 2), each input spectrogram is compressed by a factor of 16 during encoding, both in time T and in frequency F domain. The remaining F × T features for each of the 512 channels are mapped to the corresponding fully connected layer, conducting global average pooling, followed by a projection to the number of parametrizable output nodes/classes. During training, random data augmentation and sub-sampling/ padding, can be enabled. However, to compare validation and test set results across various models, random data augmentation and sub-sampling/padding was disabled. Validation and test samples were centered and either zero-padded or sub-sampled, in case the original length did not match the chosen sequence length 34 . ANIMAL-SPOT was implemented in PyTorch 17 using a cross entropy loss in combination with a batch-size of 8 for all animals, together with an Adam optimizer applying an initial learning of 10 −5 , β 1 = 0.5, and β 2 = 0.999. Additionally, ANIMAL-SPOT integrates a learning rate decay of 1/2 after 4 epochs without any improvement on the validation set. The training was stopped after an animal-specific number of epochs (see Supplementary  Table S2) if no improvement was achieved on the validation set (early stopping). The accuracy was chosen as an appropriate network validation criterion. ANIMAL-SPOT integrates an intelligent data split mechanism, capable of automatically identifying all class labels, assuming that data preparation was performed in the prescribed format 33 , and ensures that samples of a particular recording are only present in one of the splits. By default, the data split is 70% for training, 15% validation, and 15% test. However, it may differ depending on the original data distribution in combination with the above mentioned recording restriction (see Supplementary Tables S4-S6).
Regarding the two challenge datasets-BirdVox-Full-Night 36 and ComParE-PRS 37,38 -the original predefined data splits have been applied for reasons of comparison. Network training and evaluation was computed utilizing mid-range graphics processing units (GPUs) (e.g., Nvidia GTX 1080), as well as standard central processing units (CPUs), showing the broad applicability of the training setup. Supplementary Table S2 reports all animalspecific network-hyperparameters. ANIMAL-SPOT's network performance was evaluated via the following experimental constellations: (1) animal-and scenario-specific model evaluation for target/noise detection (see Supplementary Table S3) and multi-class classification (see Supplementary Tables S4, S5, S6), reporting various performance metrics regarding training, validation, and unseen test set, (2) evaluation of animal-specific target/noise detection networks on three fully-annotated unseen test recordings, performing a sliding window approach in combination with a given window-length ǫ and step-size κ to frame-wise segment between target and noise, and (3) inspection and verification of the multi-class classification models for warblers and monk parakeets based on the machinesegmented and extracted signal parts of step 2, generated by the corresponding segmentation models.
The first evaluation scenario visualizes the following training, validation, and test metrics: accuracy (ACC), true-positive-rate (TPR), false-positive-rate (FPR), precision (PREC), F1-score (F1), and area under the ROC curve (AUC). In case of the ComParE-PRS 37,38 primate species recognition challenge, only the unweighted average recall (UAR) was reported due to comparability reasons.
The second evaluation procedure classifies audio sections for each of the three unseen, animal-specific test recordings, depending on the defined window-length ǫ and step-size κ , affecting the signal overlap, as a whole. Machine-predicted audio chunks are compared frame-by-frame with the ground truth 34 . Each predicted frame/ segment of the unseen recordings, together with the respective frame-wise network confidence (probability), allow to present the ROC-curve 44 and its respective AUC 34 . In addition, frames showing a larger value than a model confidence δ , are transformed into an annotation with its corresponding start and end time. Therefore, successive frames of the same label (noise = 0 or target = 1) are concatenated and extracted as one annotation excerpt 34 . Frame-wise smoothing was used to mark classified noise segments as target frames if the neighboring signal chunks are exclusively labeled as target signals 34 . Neighboring frames are frames which include preceding or subsequent signal content of the current sound segment because of the respective overlap 34 . ANIMAL-SPOT-S refers to the smoothed version, whereas ANIMAL-SPOT corresponds to the non-smoothed variant (see Supplementary Figs. S2-S12). Additionally, the predicted, smoothed, and extracted network detections with an exemplary model confidence of δ ≥ 50% and δ ≥ 90% , were used to calculate and report time-wise precision (PREC) versus corresponding recall (TPR), in order to show intersection accuracy between machineand human-annotated labels. To calculate time-based precision and recall all ground truth annotations, which are not further apart than a merging factor ξ = ǫ 2 (half of the prediction window in seconds), were combined to one annotation, since such cases lead to sliding windows ǫ , while at least half of the window contains animal vocalizations. In case of time-wise precision calculation an additional overlapping factor = ǫ 2 was introduced,  www.nature.com/scientificreports/ extending the ground truth annotation start and end accordingly (start − , end + ), covering overlapping predictions at the annotation borders. The third and last evaluation scenario reports results on multi-class classification by presenting the following evaluation criteria on training, validation, and test data: (1) accuracy, (2) confusion matrix, and (3) UAR (only for the ComParE-PRS 37,38 dataset). Additionally, the model was evaluated on the corresponding machine-annotated warbler and monk parakeet results, utilizing the same sliding window approach. However, during prediction of the multiple classes, a noise identification was only considered as correct if the network confidence was higher than > 85%, due to the assumption that the previous detection process has a low false positive rate. If the network's confidence regarding noise was lower than this boundary, but still the highest probability, it was ignored and the second largest confidence value was chosen as correct prediction. The chosen window length, in combination with the input file duration, ends up in two potential cases: (1) window length is larger than the input file duration, leading to an updated window equal to the pre-segmented audio clip, and (2) window size is shorter than the input file size, leading to a sliding, frame-wise classification approach determined by window-and step-size, while only considering full windows in order to avoid potential misclassification. If multiple sequentially-ordered frame-wise classifications per pre-segmented file exist, probabilities of each frame and predicted class are summed up cumulatively. Finally, the class providing the largest probability mass was selected. All detection and multiclass species/call type classification metrics are visualized and illustrated in Figs. 4, 5, 6, as well as Supplementary  Table S7 and Supplementary Figs. S2-S12.
ANIMAL-SPOT guide. The ANIMAL-SPOT Guide 33 is a step-by-step and detailed user guide, publicly available together with the source code 33 , which enables researchers to train and evaluate animal-specific deep neural networks on their own bioacoustic data corpora (see Supplementary Fig. S1). The guidelines involve: (1) operating-system independent installation, data preparation, and detailed documentation of the ANIMAL-SPOT source code 33 , (2) instructions and guidance in order to set up, train, and evaluate animal-and scenariospecific architectures, as well as (3) use-case dependent prediction of unseen data material utilizing stand-alone noise/target detection models, species/call type classification networks, or a combined version of detection and subsequent classification (see Supplementary Fig. S1). The ANIMAL-SPOT guide provides a detailed description with respect to the following three scenarios: (1) single-stage detection between animal vocalizations of interest and noise, based on unseen data, (2) single-stage classification of animal species and/or specific call types directly on unseen raw audio material, and (3) a combined version of step 1 and 2 by firstly pre-segmenting unseen audio recordings, followed by subsequent classification (animal species, call types, etc.), while taking only the respective pre-segmented target vocalizations as input.

Experiments
Animal-species target/noise segmentation. In a first experiment animal-species segmentation was performed for all animal-specific (see Fig. 1) data volumes listed in Table 1. Data partitioning was conducted for each animal-specific data archive (see Supplementary Table S3), whereas the training set comprises ≈70%, validation and test set each ≈15% of the total labeled data corpora listed in Table 1. Using the respective data distributions in combination with the animal-specific network-hyper-parameters presented in Supplementary Table S2, different ANIMAL-SPOT architectures were trained and evaluated according to the previously described network training and evaluation procedure. An exception is the BirdVox-Full-Night 36 challenge dataset (see Table 1, last row), which used the same data distribution, training, and evaluation procedure as described in Lostanlen et al. 36 , in order to allow a meaningful comparison with the original results. Consequently, a leaveone-out testing procedure whereby one unit of the given six was utilized for testing and the other five were used for training and validation. This results in exactly the same data split as reported, but also means that there exist no additional and unseen data available for further evaluation, as it was the case for all other data corpora listed  Multi-class species and call type classification. As baseline for the second experiment, results of the first target/noise (binary) animal-species segmentation were utilized, next to a pure stand-alone multi-species   www.nature.com/scientificreports/ classification scenario without pre-segmentation. To demonstrate, prove and verify performance of the proposed multi-step classification procedure, three of the 12 animal species (see Fig. 1) were utilized-Golden-/Bluewinged warblers (genus) and monk parakeets. In case of the warblers a subsequent multi-class classification model, trained on the data and distribution listed in Supplementary Table S4, was used to further separate the genus-specific and previously machine-segmented data pool into Golden-winged, Blue-winged warblers, other birds, and noise. To counteract possible false alarms from the segmentation phase, two classes-other birds and noise-were introduced besides the corresponding signals of interest.
The proposed multi-stage approach was further evaluated training a multi-class classification network to differentiate between various monk parakeet call types, using the data and distribution in Supplementary Table S5, including contact, alarm, and other calls, as well as noise, in order to handle previous segmentation errors.
In case of the golden-and blue-winged warblers, a total of 210 machine-annotated audio segments were extracted utilizing a network confidence of ≥ 90%. Under identical conditions 103 monk parakeet machine segmentations were predicted and extracted. Example spectrograms for golden-/blue-winged warbler vocalizations, as well as for the various monk parakeet call types, are visualized in Figs. 5 and 6.
In order to assess the efficacy of the multi-stage approach, instead of a single-stage multi-class approach, a multi-class model was exemplarily trained to perform detection and classification of Golden-winged warbler, Blue-winged warbler, and noise (pure background noise, other birds) in one step, using exactly the same three unseen, manually labeled recordings for evaluation as during the detection phase within the multi-stage procedure (see Table 1). The three audio files contain either: (1) only Golden-winged warblers, (2) only Blue-winged warblers, (3) a combination of both warbler types. The model used the same data distribution for Golden-/ Blue-winged warbler as stated in Supplementary Table S4, together with the warbler noise distribution listed in Supplementary Table S3.
Nevertheless, in order to also show and demonstrate the possibility of directly training a multi-species classification network without previous segmentation, the ComParE-PRS dataset (see Table 4) was used to distinguish between 4 different primate species as well as background noise, trained on the given data distribution listed in Supplementary Table S6.

Results
Animal-species target/noise segmentation. ANIMAL-SPOT successfully segmented all 10 target species, as well as the additional genus, leading to an overall mean test set accuracy of 97.9% (range: 94.5-99.8%). Additionally, an average area under the ROC curve (AUC) of 95.9% (range: 91.7-99.1%) across all 33 unseen animal-specific recordings (3 tapes per detection scenario) was achieved (see Supplementary Table S7. Besides network generalization on the unseen tapes, a detailed performance overview with respect to model training, validation, and testing is reported and visualized in Supplementary Figs. S2-S12. In addition, all detection results are summarized and available in Supplementary Table S7. Moreover, Fig. 4 summarizes detection results in a compressed version, averaged across all 11 animal-specific segmentation models (10 different species, 1 additional genus), visualizing: (1) network performance metrics based on the animal-specific human-annotated training, validation, and testing repositories (see Supplementary Table S3, Fig. 4-1.1), (2) model results across all 11 averaged Receiver-Operating-Characteristics 44 (ROC) curves by visualizing 2 out of 11 curves, indicating the minimum and maximum AUC, spanning the average AUC-range where all other remaining ROC-curves are located (Fig. 4-1.2), and (3) network output across all 11 averaged and threshold-dependent precision/ recall scores (Fig. 4-1.3). Apart from the segmentation results of all 11 animal-specific segmentation models (10 different species, 1 additional genus), ANIMAL-SPOT also successfully processed the BirdVox-Full-Night 36 dataset. In comparison to the results given by Lostanlen et al. 36 , with the best performance coming from a CNN with noise augmentation, which resulted in an average accuracy of 94.9% and an average F1-Score of 62.7%, ANIMAL-SPOT achieved a slightly better average accuracy of 95.4% and a significantly better F1-Score of 95.4%. These results were achieved by training 10 models for each unit and taking the average of the results when removing the best and worst two performing models, resulting in an average over six models for each unit.

Multi-class species and call type classification.
Multi-class species classification was applied to the previous warbler target/noise detection results (see Supplementary Fig. S6), in order to further separate the machine-segmented warbler species vocalizations, exemplarily visualized in Fig. 5-1.1,1.2, into Blue-winged and Golden-winged warbler, resulting in a multi-class (4-classes) species identification scenario. Therefore, ANIMAL-SPOT, trained in a multi-class species classification scenario using the data listed in Supplementary  Table S4, achieved an overall accuracy of 96.6% for the human-labeled unseen test set, as well as 95.2% with respect to the total number of 210 previously machine-detected audio segments (see Supplementary Fig. S6). Moreover, training and validation accuracy is shown besides two confusion matrices (4 classes), visualizing the aforementioned results achieved on the respective unseen human-labeled and machine-segmented test corpora (see Fig. 5-1.3-1.5, Supplementary Fig. S6).
Compared to the results of the proposed two-stage approach, which includes target/noise detection and downstream multi-class identification, the single-stage method, which performs both, detection and classification, in a multi-class model at once, performed significantly worse. Across all three unseen warbler recordings the Golden-/ Blue-winged warbler detection model (threshold δ ≥ 0.9) identified a total of 210 potential vocalizations of interest, resulting in a time-based precision of 95.3% (see Supplementary Fig. S6). All 210 of the segmented samples were then used for downstream multi-class classification. In comparison, the 3-class single-stage approach detected 233 warbler events, with 200 true predictions resulting in a sample-based precision of 85.8%, whereas just 77.3% (180 out of 233 vocal events) were detected and classified as the correct warbler species. In the case www.nature.com/scientificreports/ of the two-stage approach multi-species classification achieved an accuracy of 95.2% (see Fig. 5). This indicates that a two-step approach, where the network can focus more on the distinguishing features of the individual bird species without also having to filter out as much noise, is preferable to a single-step approach. Besides all the results regarding warbler-species classification, ANIMAL-SPOT was also successfully deployed to identify various primate species using the Computational Paralinguistics Challenge Primate (ComParE-PRS) 37,38 data archive. The initial challenge investigation utilizes five different approaches for feature extraction and classification of primate vocalizations, namely openSMILE 45 , openXBOW 46 , DeepSpectrum 47 , AuDeep 48 , and End2You 49 in conjunction with either an SVM (openSMILE, openXBOW, DeepSpectrum, AuDeep) or a recurrent neural network (RNN) (End2You) for the final classification. The initial baseline for the challenge, calculated by majority voting using the best configuration for each approach, reported the best unweighted average recall (UAR) of 87.5% 37,38 . In comparison, ANIMAL-SPOT outperformed the baseline achieving a UAR of 89.3%. Multi-class call type classification was applied to the previous monk parakeet target/noise detection results (see Supplementary Fig. S5), similarly to the warblers, however, with the aim to classify between different call types visualized in Fig. 6-1.1-1.3, leading to a multi-class (4 classes) monk parakeet call type classification scenario. For this purpose, ANIMAL-SPOT was trained on the data listed in Supplementary Table S5. The final model achieved an overall test set accuracy of 92.7%, compared to 88.4% on the previous machine-based detection results (see Fig. 6-1.4-1.6, Supplementary Fig. S5).

Discussion and future outlook
In total, 10 different species and 1 extra genus (see Figs. 1, 4, Supplementary Figs. S2-S12), as well as the publiclyavailable BirdVox-Full-Night 36 dataset, were analyzed in a binary detection scenario in order to prove ANIMAL-SPOT's ability to generalize across a wide variety of sound-types and to assess the feasibility of the proposed multi-stage detection/classification pipeline (see Figs. 5 and 6). As the results on the unseen recordings prove, promising time-wise and threshold-dependent recall/precision values were achieved, indicating an accurate intersection between ANIMAL-SPOT's predictions and the actual ground truth (see Fig. 4 Table S7 and Supplementary Figs. S2-S12). In addition, ROC-curves and corresponding AUC values show a significant reduction of the species-dependent and original noise-heavy data material (see Fig. 4-1.2). Thus, threshold-dependent recall and false-positive-rate combinations can be derived according to the respective usecase, which in turn considerably speeds up and improves downstream data analysis. Furthermore, the combined strong results seen in both unseen test set as well as unseen real-world recordings, suggest no indication of model overfitting and prove network generalization across all different animal species.

-1.3, Supplementary
The improvements with respect to the publicly available BirdVox-Full-Night 36 dataset are also very promising, as the detection accuracy was improved by 0.6%, which indicates an error reduction of about 12%, besides a significant improvement of 32.7% regarding the F1-Score.
ANIMAL-SPOT integrates a large repertoire of distinct parameterization options for setting up data preprocessing and network training (see Supplementary Table S2). Thus, ANIMAL-SPOT performs equally across wide ranges of temporal contexts (e.g average vocalization duration of Pygmy pipistrelles compared to killer whales), frequency ranges (e.g low-range Atlantic cod and Harbour seal vocalizations, mid-range bird sounds, and ultrasound bat signals), as well as spectral patterns (e.g., pulse-like structure of the Harbour seal or warbler signals and harmonic properties of the killer whale, Atlantic cod, and chimpanzee vocalizations). It is even possible to learn and distinguish between spectral call structures which are very similar to noise, seen in ANIMAL-SPOT's exemplary ability to distinguish Sulphur-crested cockatoo, Harbour seal, monk parakeet, and chimpanzee vocalizations from very similar background noise. In case of binary target/noise detection, ANIMAL-SPOT is especially useful in recording situations where the noise characteristics are an order of magnitude larger than the amount of valuable animal vocalizations.
ANIMAL-SPOT's parameterization capacity also enables flexible adaptations regarding model architecture, data preprocessing, and network training/evaluation, allowing researchers to address and answer various specific bioacoustic research questions. Furthermore, the binary-class target/noise detection process enables researchers to separate target species that show poor results in the single-stage binary target/noise detection scenario. This can occur especially when the target species spectrally resemble other vocalizing species that are also found in the unseen recordings. In such situations the primary focus is on a generic distinction between target vocalizations and superfluous noise, making subtle spectral differences of other species difficult to model, because of generalization properties across both classes leading to increasing mis-classifications. This phenomenon was observed in case of Blue-winged and Golden-winged warblers, after both were individually trained and analyzed on species level, which demonstrated significant performance variations. However, using a two-step identification scenario consisting of target/noise detection at genus level (see Supplementary Fig. S6), and subsequent multi-class species classification, ANIMAL-SPOT achieved an overall test set accuracy of 96.6% on unseen test data, which had been labeled by a human expert, and an accuracy of 95.2% on the target detections identified by ANIMAL-SPOT in the target/noise detection scenario (see Fig. 5).
In addition, the same two-step approach was successfully applied to distinguish and classify between different monk parakeet call types, resulting in 92.7% test set accuracy for human-annotated samples, and 88.4% with respect to the machine-performed detection results (see Supplementary Fig. S5, see Fig. 6). These combined results demonstrate the wide range of biological scenarios which can be covered by ANIMAL-SPOT in combination with user-and animal-specific data material. In both multi-class classification scenarios-warbler species and monk parakeet call types-ANIMAL-SPOT extracts and classifies centered signal sections of the unseen network test set samples according to the training sequence length (see Supplementary Table S2,  . Furthermore, sequence length and step-size are very important parameters to guarantee robust predictions. In terms of species and/or call type classification the step-size should be an order of magnitude smaller than the sequence length, in order to guarantee sufficient overlap and not to miss important spectral features during the prediction phase. ANIMAL-SPOT demonstrated also great results in terms of single-stage multi-species primate classification, by outperforming the ComParE baseline system. The final result of 89.3% also exceeds the UAR of 88.3% presented by challenge competitor Illium et al. 50 , who applied a vision transformer to the classification problem. Müller et al. 51 report the same UAR of 89.3% while applying a Deep Recurrent Neural Network. The remaining competitors who performed better than ANIMAL-SPOT utilized either ensembling of multiple classifiers, as in the case of Egas-López et al. 52 , who achieved a UAR of 89.8%, or data augmentation techniques such as SpecAugment or MixUp and training tricks such as exponential moving average of the model weights, as presented by Thomas Pellegrini 53 , who achieved a UAR of 92.5% on the test set. ANIMAL-SPOT is therefore placed squarely in the middle of the top challenge performers despite using only a single, relatively simple classifier and basic augmentation techniques. In order to robustly train and report promising results, data volume, distribution, and variation is crucial. Moreover, the data corpus must be representative with respect to unseen real-world data. If these criteria are not fulfilled, models often lead to significantly worse results, despite promising training, validation, and test metrics. In order to enlarge data variation, especially for small animal corpora, various embedded spectral augmentations were computed (see Supplementary Table S2). However, such augmentation variants and corresponding values must be determined independently for each animal species and can therefore not be generalized. In particular, noise augmentation must be applied carefully, because of differing Signal-to-Noise-Ratio (SNR) between the original sounds and utilized noise data, particularly in case of animal vocalizations being very similar to noise data (e.g., Sulphur-crested cockatoo, distant chimpanzee pant-hoot versus bird vocalizations in the same frequency range). Therefore, it is essential to ensure that noise samples, chosen for augmentation, are representative and independent from training, validation, and test noise excerpts. Despite promising scenario-and animalspecific results on the unseen test data, audio recordings, machine-driven pre-detections, and challenge datasets (see Figs. 4, 5, 6, Supplementary Figs. S2-S12, and Supplementary Table S7), the performance may still vary to a certain extent, due to the following reasons: (1) non-representative data and/or insufficient training data, (2) recording artifacts introducing spectral outliers which are difficult to interpret by the network, (3) other animal vocalizations or noise characteristics showing a similar spectral envelope as the target sounds, (4) strong deviation of the signal intensities compared to the chosen reference and minimum dB-values of the 0/1-dB-normalization during training (see Supplementary Table S2), (5) overlapping animal signals and human narrations, (6) vocalization types of a given species which have significant spectral and temporal differences between each other, and (7) window-length ǫ and step-size κ used during prediction phase. Figure 7 visualizes different examples of such animal-and task-specific misclassifications, caused by the previously illustrated error sources, which significantly influence network prediction results.  www.nature.com/scientificreports/ Furthermore, during multi-class classification, special attention needs to be paid towards correct machinebased detection outputs indicating one of the following scenarios: (1) multiple vocalizations of the same and/ or different species/call types within a single segment, (2) truncated signals, either at the beginning or end of a segment, and (3) overlapping vocalizations. Examples of the above mentioned and remaining challenges are visualized in Fig. 8.
In addition, data collection should be conducted via a consistent recording setup. Including data material originating from varying recording environments and/or setups will result in spurious outputs unless sufficient examples of this variation is represented in the training and validation datasets.
ANIMAL-SPOT's performance and network training stabilization strongly correlates with the chosen hyperparameter setup, respective data structure and distribution, as well as model initialization. In order to identify the best fitting training setup for a certain species and classification procedure, a parametric search within the target-specific value range (with regards to e.g., signal frequency range, average sound duration, type of vocalization) should be performed. Additionally suitable prediction settings-window length ǫ and step-size κ-as well as network parameters (see Supplementary Table S2) are very important. Window length ǫ has to be approximately in the same dimension as the network training length. To ensure adequate prediction settings, ANIMAL-SPOT should be evaluated on a small portion of unseen manually labeled recordings, before processing large unseen data archives. Moreover, network initialization, as well as random augmentations during training, may impact network performance, especially in case of small training corpora, affecting final model performance despite similar training, validation, and test set metrics.
Researchers face various acoustic detection scenarios, namely simple target/noise segmentation, identification of target signals among other distinct animal-specific vocalizations, and the recognition of target vocalizations among other similar animal-related vocalizations. All these scenarios can be addressed by the ANIMAL-SPOT framework and its underlying methods. For the simple case of identifying a target signal among nondescript background noise a simple one-step procedure can be applied as well as utilization of the framework-supplied noise augmentation to account for differences in signal-to-noise ratios in varying real-world conditions. Similarly, in the case where the target signal is dissimilar to the to other known vocalizations, a one-step model application procedure can be applied, and the classification is altered from a binary target-noise scenario to a multi-class problem which includes vocalizations from other known species present in the recordings. Finally, when dealing with the scenario in which the target vocalization exhibits similar characteristics such as to make them difficult to discern from each other, a multi-step approach can be taken, as was shown when attempting to accurately distinguish between Blue and Golden-winged warblers (see Fig. 5) or different monk parakeet call types (see Fig. 6). The same applies to the recognition of species-specific dialects and single individuals. The first task is to eliminate to the fullest extent the background noise (pure noise, other dissimilar animal vocalizations) from the classification problem. After background noise is removed from the data, it appears that the model is more capable of distinguishing between similar acoustic features through the focus on other spectral characteristics and features. Note that, due to the relatively small model sizes used here, a two-step approach could also be applied to the case where vocalizations are dissimilar without incurring a significant penalty with respect to computation time.
Besides the animal-independent target/noise and multi-step/class identification results (see Figs. 4, 5, and 6), this study also puts special emphasis on the proposed ANIMAL-SPOT guide 33 (see Supplementary Fig. S1), which enables researchers to setup their own user-specific deep learning framework, without the need of prior machinelearning knowledge. The ANIMAL-SPOT guide 33 describes the entire software framework from beginning to end, including OS-specific installation manuals regarding all necessary software components, data preparation and processing guidelines, as well as detailed descriptions on how to setup, train, and run the final network prediction/evaluation on unseen data (see Supplementary Fig. S1).
The entire deep learning framework, as well as user-and animal-specific setup, can be verified and evaluated through the additionally provided example data archive on monk parakeets 54 , which is publicly available 33 , next to all the source code and user-friendly instruction manual. This guide enables the bioacoustic community to independently train/evaluate task-and animal-specific deep models in order to gain deeper insights into animal communication and understanding. www.nature.com/scientificreports/ Many different fields of potential future follow-up work arise, such as (1) animal-specific representation 55 and/or transfer learning, utilizing larger labeled/unlabeled bioacoustic data corpora and/or other data archives (e.g., ImageNet 56 ), (2) investigation regarding various deep network architectures (e.g., CNN-LSTM 25 , ResNeXt 57 , Inception/Inception-ResNet 58 , or Transformer-based approaches 59 ), and (3) animal-independent signal enhancement/denoising 60 , acting as additional data preprocessing option. To the best of the authors knowledge, ANI-MAL-SPOT is the first open-source 33 machine learning approach, capable of handling various bioacoustic signal identification scenarios (binary target/noise detection, multi-class species/call type classification), verified on a wide portfolio of animal vocalizations from different animal taxa and challenge datasets. In combination with a detailed user guide, ANIMAL-SPOT allows the broader bioacoustic research community to develop their own task-specific deep neural networks, on virtually any animal species.

Data availibility
The acoustic data archives supporting the findings of this study are available from the respective data owners upon reasonable request. Contact details can be obtained from the corresponding author. Upon acceptance, the code for ANIMAL-SPOT, besides the entire ANIMAL-SPOT guidelines, all together with an example data corpus 54 , will be made publicly available at https:// github. com/ Chris tianB ergler.